首页> 外文OA文献 >kClust: Fast and sensitive clustering of large protein sequence databases.
【2h】

kClust: Fast and sensitive clustering of large protein sequence databases.

机译:kClust:快速,灵敏的大蛋白质序列数据库聚类。

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Background Fueled by rapid progress in high-throughput sequencing, the size of public sequence databases doubles every two years. Searching the ever larger and more redundant databases is getting increasingly inefficient. Clustering can help to organize sequences into homologous and functionally similar groups and can improve the speed, sensitivity, and readability of homology searches. However, because the clustering time is quadratic in the number of sequences, standard sequence search methods are becoming impracticable. Results Here we present a method to cluster large protein sequence databases such as UniProt within days down to 20%–30% maximum pairwise sequence identity. kClust owes its speed and sensitivity to an alignment-free prefilter that calculates the cumulative score of all similar 6-mers between pairs of sequences, and to a dynamic programming algorithm that operates on pairs of similar 4-mers. To increase sensitivity further, kClust can run in profile-sequence comparison mode, with profiles computed from the clusters of a previous kClust iteration. kClust is two to three orders of magnitude faster than clustering based on NCBI BLAST, and on multidomain sequences of 20%–30% maximum pairwise sequence identity it achieves comparable sensitivity and a lower false discovery rate. It also compares favorably to CD-HIT and UCLUST in terms of false discovery rate, sensitivity, and speed. Conclusions kClust fills the need for a fast, sensitive, and accurate tool to cluster large protein sequence databases to below 30% sequence identity. kClust is freely available under GPL at http://toolkit.lmb.uni-muenchen.de/pub/kClust/ webcite.
机译:背景技术在高通量测序快速发展的推动下,公共序列数据库的规模每两年翻一番。搜索越来越大和更多冗余的数据库的效率越来越低。聚类可以帮助将序列组织成同源且功能相似的组,并可以提高同源性搜索的速度,灵敏度和可读性。但是,由于聚类时间在序列数上是二次的,因此标准序列搜索方法变得不可行。结果在这里,我们提出了一种方法,可以在几天之内将大型蛋白质序列数据库(如UniProt)聚类到最大成对序列同一性的20%–30%。 kClust的速度和灵敏度归因于无需对齐的预过滤器(可计算序列对之间所有相似的6-mer的累积得分),以及可对相似的4-mer进行操作的动态编程算法。为了进一步提高灵敏度,kClust可以在配置文件序列比较模式下运行,其中配置文件是从以前的kClust迭代的聚类中计算得出的。 kClust比基于NCBI BLAST的聚类要快2到3个数量级,并且在最大成对序列同一性为20%–30%的多域序列上,它可以实现相当的灵敏度和更低的错误发现率。就错误发现率,敏感性和速度而言,它也比CD-HIT和UCLUST更好。结论kClust满足了对快速,灵敏和准确的工具的需求,以将大型蛋白质序列数据库聚类到30%以下的序列同一性。 kClust是根据GPL免费提供的,网址为http://toolkit.lmb.uni-muenchen.de/pub/kClust/ webcite。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号